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Abstract. In this paper, we analyze the performance of random load re- 
sampling and migration strategies in parallel server systems. Clients initially 
attach to an arbitrary server, but may switch servers independently at random 
instants of time in an attempt to improve their service rate. This approach 
to load balancing contrasts with traditional approaches where clients make 
smart server selections upon arrival (e.g., Join-tho-Shortest-Qucuc policy and 
variants thereof). Load resampling is particularly relevant in scenarios where 
clients cannot predict the load of a server before being actually attached to 
it. An important example is in wireless spectrum sharing where clients try to 
share a set of frequency bands in a distributed manner. 

We first analyze the natural Random Local Search (RLS) strategy. Under 
this strategy, after sampling a new server randomly, clients only switch to it if 
their service rate is improved. In closed systems, where the client population is 
fixed, we derive tight estimates of the time it takes under RLS strategy to bal- 
ance the load across servers. We then study open systems where clients arrive 
according to a random process and leave the system upon service completion. 
In this scenario, we analyze how client migrations within the system inter- 
act with the system dynamics induced by client arrivals and departures. We 
compare the load-aware RLS strategy to a load-oblivious strategy in which 
clients just randomly switch server without accounting for the server loads. 
Surprisingly, we show that both load-oblivious and load-aware strategies sta- 
bilize the system whenever this is at all possible. We further demonstrate, 
using large-system asymptotics, that the average client sojourn time under 
the load-oblivious strategy is not considerably reduced when 
clients apply smarter load-aware strategies. 

1. Introduction 

Load balancing is a key component of today's communication networks and 
computer systems in which resources are distributed over a wide area or across 
a large number of systems and have to be shared by a large number of users. 
Load balancing enables efficient resource utilization and thereby tends to improve 
the quality of service perceived by users. Traditionally, load balancing has been 
achieved by applying smart routing policies: when a new demand arrives, it is 
routed towards a particular resource depending on the current loads of the various 
resources, see |14j and references therein. In contrast, we are interested in systems 
where a new task is initially assigned to a resource chosen at random irrespective 
of the current resource loads, but where tasks can be re-assigned, i.e., migrate from 
one resource to another. 

Our primary motivation stems from the increasing popularity of Dynamic Spec- 
trum Access (DSA) techniques [1] as a potential mechanism for broadband access in 
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future wireless systems. A common implementation platform for DSA is the use of 
reprogrammable Software-Defined- Radios (SDRs). These new radios are frequency- 
agile or flexible, and have the ability to rapidly jump from one frequency band to 
another in order to explore and exploit large parts of the spectrum. A central 
question in DSA is how multiple users may fairly and efficiently share spectrum in 
a distributed manner. Typically, the service rate of a user on a given frequency 
band is inversely proportional to the number of users transmitting on this band, 
i.e., to the load of the band. As new users entering the system have no way of 
determining the load on each frequency band, they have to initially select a band 
randomly. Should a user receive a quite poor quality of service on a given band, she 
may resample a new band at random and decide to switch to it. The overall system 
performance then strongly depends on the distributed resampling and switching 
strategy implemented by each user. 

Though our primary motivation is DSA, our methods and results could provide 
insight into a number of other applications. One such pertains to wireline networks, 
where there has recently been interest in multipath routing |15| . Here, users may 
use several path to download files, and have to select the appropriate path or the 
set of paths. Another application is in transport networks, where one might wish 
to understand how Wardrop equilibria, which correspond to the equalization of 
journey times across alternative routes, are achieved or approximated by network 
users acting on limited information. Our results could also shed insight on how 
quickly such equilibria can be re-established following major disruptions or other 
changes to the network. Finally, note that distributed load resampling can also 
be thought of as a game between selfish users. In fact, it is an instance of a 
congestion game (see e.g. [IB]), and our results shed light on the time to reach a 
Nash equilibrium in such a game, but it also helps understanding the outcome of 
the game with a dynamic population of players. 

We consider a generic system consisting of multiple servers (in DSA, frequency 
bands) employing the Processor Sharing (PS) service discipline, shared by clients 
who have to initially pick a server at random, and may later resample servers and 
migrate during their service. We restrict our attention to two natural distributed 
resampling and migration strategies, the Random Local Search (RLS) and Random 
Load- Oblivious (RLO) strategies. When implementing the RLS algorithm, a user 
resamples a new randomly chosen server at the instants of a Poisson process, and 
migrates to this new server if its load is smaller than that of the initial server. In 
contrast, under the RLO algorithm, a user hops between servers according to a 
random Markovian jump process irrespective of the loads of the visited servers. 

We investigate both closed systems with fixed population of clients, and open 
systems with a population whose dynamics are governed by client arrivals and the 
completions of their services. In closed systems, we are interested in characterizing 
the time that it takes under the RLS algorithm to balance all server loads (note that 
here the RLO algorithm does not balance loads except in an average sense - so we 
do not study this algorithm in closed systems). In open systems, users arrive at the 
various servers according to independent stochastic processes of fixed intensities, 
and leave upon service completion. In this scenario, client migrations within the 
system interact in a complicated manner with the system dynamics induced by 
client arrivals and departures. We aim at characterizing system stability under the 
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RLS and RLO strategies, as well as at deriving estimates of user sojourn times. 
Our contributions are as follows: 

• Closed systems. We show that, starting from an arbitrary allocation of 
users to servers, the time t it takes to achieve perfect balance of server 
loads scales at most as log(m)(^ + log(m)), where m and n denotes the 
number of servers and users, respectively. This considerably improves over 
the existing bounds that stated that r scales at most as (see e.g. [T2]). 
We also investigate the time to reach an approximate e-balance (a system 
reaches an approximate e-balance if there exists p such that the number of 
users associated to any server lies between (1 — e)p and (1 + e)p) . Achieving 
such balance is much faster than reaching a perfect balance, and we show 
that Te scales at most as log(m)/e. 

• Open systems. We demonstrate that both RLS and RLO strategies achieve 
the largest stability region possible, i.e., that the system is stable under 
these two algorithms provided that ■^i < f^i^ where denotes 
the initial user arrival rate at server i and fii is the service rate of this server. 
The result is not surprising for RLS, but less intuitive for RLO since, under 
this algorithm, users take no account of server loads when migrating. For 
both RLS and RLO strategies, we derive approximate estimates of the 
average user sojourn time using large-system asymptotics. The estimates 
are shown to be exact when the number of servers grows large, but turn out 
to be quite accurate for systems of limited sizes as well. Our first numerical 
results suggest that again, surprisingly, the average client sojourn time 
under the load-oblivious RLO strategy is not considerably reduced when 
clients apply smarter load-aware RLS strategy. To our knowledge, this 
paper is the first to analyze the performance of RLS and RLO algorithms 
in open systems. 

The paper is organized as follows. In the next section, we describe our model and 
notation. Sections 3 and 4 are devoted to the analysis of closed and open systems, 
respectively. We give the related work in Section 5, and provide concluding remarks 
in Section 6. 

2. Model description and notation 

We consider a set of m Processor Sharing servers of respective capacities fii, . . . , fij 
The system is homogeneous if fit = 1 for all i = 1, . . . ,m. The system state at 
time t is represented by the number of clients associated to each server, N{t) = 
{Ni{t), . . . , N,n{t)). The service rate of a client associated to server i at time t is 
then fii/Ni{t). Clients independently resample and switch servers to selfishly im- 
prove their service rate. They have a myopic view of the system in the sense that 
they are aware of their current service rates, but do not know the service rate they 
would achieve at other servers. Given this myopic view, it is natural to consider and 
analyze the two following random distributed resampling and migration algorithms: 

• Random Local Search (RLS) algorithm. At the instants of a Poisson process 
of intensity /3 > 0, a client picks a new server uniformly at random and 
migrates to it if and only if this would increase her service rate. In other 
words, if at time t, a client associated to server i picks server j, she migrates 
to j if and only if iJ.j/{Nj{t) + I) > fJ,i/Ni{t). 
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• Random Load- Oblivious (RLO) algorithm. After arriving in the system, 
each chent visits successive servers according to a continuous-time random 
walk with transition matrix Q = {(/y, i, j = 1, ■ ■ • , to}. The random walks 
are independent across clients, and irreducible. We denote by tt the station- 
ary distribution of this random walk. Note that as a consequence of irre- 
ducibility, clients visit all servers eventually, i.e., tt^ > for alH = 1, . . . , m. 

Note that under the RLO algorithm, clients do not take loads into account when 
switching servers. In particular they may move to a server with a higher load. An 
example of such a resampling strategy is as follows. Each client has a Poisson clock 
of rate /? > and, when her clock ticks, she picks a new server uniformly at random 
and moves there irrespective of its load. 

We analyze the performance of distributed resampling and migration strategies 
in closed and open systems. In closed systems, the total population of clients is 
fixed, equal to n. For such systems, we investigate the time it takes under the 
RLS algorithm to balance clients across servers, starting from any arbitrary system 
state. In open systems, exogenous clients associate to server i according to a Poisson 
process of intensity Ai (the arrival processes are independent across servers) . Client 
service requirements are i.i.d. exponentially distributed with unit mean. Under 
RLS and RLO algorithms, {N{t),t > 0) is a Markov process. In open systems, 
we are interested in characterizing the stability region of RLS and RLO strategies, 
defined as the set of arrival rates A = (Ai, . . . , Am) such that the system is stable, 
i.e., such that {N{t),t > 0) is positive recurrent. We also aim at estimating the 
average client sojourn time. 

3. Closed systems 

In this section, we analyze the performance of the RLS resampling strategy in 
a closed homogeneous system, and obtain tight bounds on the expected time to 
balance the server loads. 

Recall that there are n clients distributed among m servers. Let n = qm + r, 
< r < m — 1. We now define the following: 

• The state N{t) = (A^i(t), . . .,Nrnit)) is balanced if \Ni{t) - Nj{t)\ < 1 for 
1 !i * < j ^ The time to balance, r, is defined as 

T := inf{t > : N{t) is balanced}. 

• The state N{t) is e-balanced if (l-e)p < Ni{t) < {l+e)p for alH = 1, . . . , m, 
where p = n/m. The time, t^, to e-balance is defined as 

:= inf{< > : N{t) is e-balanccd}. 

Let /,5 : N ^ M+. Wc say /(fc) 0{g{k)) if there exist fco G N and c e M+ 
such that f{k) < cg(k) for all k > fco. Similarly, for /, g : — ^ IR+. wc say 
f{k, I) = 0{g{k, I)) if there exist fco, £ N and c G R+ such that /(fc, /) < cg{k, I) 
for all fc > fco and I > Iq. 

3.1. Time to reach balance. We now characterize the time required by the RLS 
algorithm to reach perfect balance and e-balance. 

Theorem 3.1. The expected time, E[t], for randomized local search to achieve 
balance is 0{log{m){^ -f log(TO))). 
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Theorem 3.2. The expected time, E[tc] for randomized local search to achieve 
e-balance is 0(log(m)/e). 

Remarks 

(1) It is easy to see, applying Markov's inequality, that the same upper bounds 
on the time to balance hold in probability as in expectation. 

(2) We now compare our bounds on r with that from [T^]. From Theorem 2.7 
of |12| . the expected number of attempted moves before reaching balance 
is 0{w?n). Since move attempts (resampling) occur at rate n, this gives us 
a time complexity of 0{nn?). Our bound in Theorem 13. II is much tighter. 

(3) Our bound is close to the best possible. To see this, suppose m divides 
n exactly. At some stage, the algorithm will reach an allocation in which 
one server has n/m + 1 clients, one other server has n/m — 1 clients and 
all others have exactly n/m clients. Each of the n/m + 1 clients at the 
overloaded server attempts to move at rate 1, and each move attempt is 
successful with probability 1/m. Hence, the mean time for just the final 
move is m^/(m + n) > m?/{2n). Our bound is only a logm factor higher 
than the time for the last move. 

Alternatively, consider the situation when ~ o(n) and all n clients 
are initially at the same server. Then, at least n — \n/m~\ clients need to 
move out of this server to reach balance. When there are k clients at the 
server, the expected time to the next move is at least 1/k (possibly more, 
as the move attempt may not be successful). Hence, the expected time to 
reach balance is at least 

n 1 r I , , 

t - / . 7^2; = logm. 

k—\n/m\+l ' 

Again our bound is only a log m factor higher than the above lower bound 
on the time to reach balance. 

3.2. Proofs. Without loss of generality, we take the rate /3 of the independent 
Poisson clocks at each client to be unity. A client at server i whose clock has ticked 
at time t attempts to move by sampling a server uniformly at random from all m 
servers. It moves to the sampled server, say j, if and only if Ni{t) — Nj{t) > 1. 
Clearly, N{t) evolves as a continuous time Markov chain. 

3.2.1. Proof of Theorem\3j\ Define V{t) := maxi<j<„ Nj(t), i.e., V{t) is the 
maximum number of clients associated with any server at time t. Define Cy(t) to 
be the number of servers with exactly v clients, By(t) to be the number with exactly 
V — 1 clients and Aj;{t) to be the number with strictly less than v — 1 clients, all at 
time t. 

The idea of the proof is as follows. The evolution of N{t) towards balance is 
divided into phases. If V{t) = v, then N{t) is said to be in phase v. Thus, Cv{t) is 
the number of maximally loaded servers in phase v. Since a client never moves to 
a server that has more clients than its current server, V(t) is monotone decreasing 
and, in each phase, Cy (t) is also monotone decreasing. Phase v ends when Cy (t) =0. 
Let Ty denote the (random) length of phase v. Each phase can be further divided 
into sub-phases, say (w,c), when Cy{t) = c. Let ^ denote the random length of 
time that it takes for Cy(t) to decrease from c to c — 1. Observe that Ty = Ty^c 
and T = J2vT-v- When N{t) is balanced, Vit) = [^], C^iL^(t) = r if r > and 
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C|-jL] (t) — m otherwise. This gives us the maximum range for v. The number 
of sub-phases in each phase is also similarly bounded. The theorem is proved by 
bounding the expected times of each of the sub-phases and phases. 

Proof. In phase v, observe that 

vCy{t) + {v- l)By{t) < n and m - By{t) - Cy{t) = Ay(t). 

Further, for [n/m] < v < [n/{m — 1)], n/{v — 1) G (m — l,m], but if N{t) is not 
balanced, then there has to be at least one server with w — 2 or fewer clients (i.e., 
Av{t) >1). Hence 

(1) A„(i)> maxim 

Each of the vCi,{t) clients at one of the maximally loaded servers samples one of 
the m servers at random at unit rate. If the sampled server happens to be one of 
the Av{t) servers with v — 2 or fewer clients, then the client moves to the sampled 
server and C„(t) decreases by 1. This event has probability A„(t)/m. Hence, Cv{t) 
decreases by 1 at a rate no smaller than vCy(t)Ay(t)/m and from ([T]), we obtain 
that Ty^c ^ ■?t),c ~ Exp{Xy^c) where 

(2) A.,,:=i-a(0^= (^;c(max|l-^^,l|)). 

m \ \ L m[v — 1) m ) J J 

Here we write X <Y to mean that X is stochastically dominated by Y, (i.e., for 
all t, P[X > t] < P[F > t]), X ^ Y to mean that they have the same distribution 
and Exp{x) to denote an exponentially distributed random variable with rate x. In 
particular, 

(3) E[Ty,,] < E[fy,,] < — min| "^(^ '1) \ 

vc I [m{v — 1) — nj+ J 

where x"*" denotes maxjx, 0}. 

At any time t, Cy{t) is bounded above by [?^/^'J, since there cannot be more 
than this many servers with v clients. Since phase v ends when =0, we have 
E[Ty] < '''v,c, and we obtain 

^' I vi _ \ [m{v - 1) - n]+ ' ^ c 

Finally, t, the time it takes to reach perfect balance, satisfies 

n rn 

(5) ^ X! + X! '^\n/7n^,c- 

y=\n/m]+l c=r+l 

Now, we have by ^ that, 

™ rn 

m \ ^ 1 



[n/m],c] — 



/, ^L' n/m ,cj _ r / -I / 
c=r+l ' ' ' c=r+l 



(6) < 1+ / -dx] ^ — (1 + logm) 

n \ Ji X / n 
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For all w > [^] , we also readily see that, 

- < 1 + / -da; < 1 + log - < 1 + log m. 
^ c Ji X V 

Hence, from (|4]), (O and we obtain 

JtL T TO'' \ - m \ - TO 
(7) — < h > h > . 

1 + log TO n ^ V ^ m{v — 1) — n 

f— [n/m]+l v—[n/ {771 — 1)1+1 

The number of terms in the first sum above is at most max{l, ^i^^_i) }■ Each sum- 

mand is no more than rn^/n. Hence, the first sum is bounded above by max{^, 2}. 
The second sum is bounded above by 

m(m — 1) m(n — 1) — n 

dx = — + log- ^ ' 



m — 1 m-l ^^ ^ ' m — 1 

Substituting these expressions in ([7]) and simplifying, we get 

, ^1 -1 r iug^/r^^ 

n n 

2 

/ TO \ 
< 3(1 + log to) h log TO + 1 j . 

This completes the proof. □ 



EH < (1 +logTO) max{ — , 2} H hlog(TO^) 

* n n 



3.2.2. Proof of Theorem lS.Sl We need the following definitions. 

• Let p = n/m. Server i is e-balanced at time t if (1 — e)p < Ni{t) < 
(1 + e)p, underloaded if Ni{t) < (1 — e)p and overloaded if Ni{t) > (1 + e)p. 
Mc{t), Mu{t) and Mo{t) denote the number of e-balanced, underloaded 
and overloaded servers, respectively. 

• The underflow from server i is defined to be 



[O iiN,{t)>p 
\p — Ni(t) otherwise. 

Also, let U{i) := X]"=i Ui{t). Similarly, define the overflow from server i as 

I Ni{t) — p otherwise, 

andO(t) :-E"iO.W- 
Proof. Let No{t) be the number of 'overflowing' clients defined as 

No{t):= {m)-{^+^)p), 

where A4o{t) is the set of overloaded servers at time t. We can write 
Uit) < {Mu{t) X p) + Mc{t) X (ep), 
0{t) > Noit) + (to - Muit) - AMt)) X {ep). 

Since 0{t) = U{t), we obtain 

p Muit) + {tp)Mc{t) > Noit) + (to - M„(t) - Mcit))iep), 
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which yields 

Noit) < p(Mu(t) + Mc(t))(l + e) - em) 
Mc{t) + Mu{t) > + > max / "^^^'^ 



(l + e)p 1 + e ~ [{l + e)p 1 + . 
Now consider a client that is attempting to move at time t. We say that this attempt 
results in a good move if the attempt results in a migration that reduces No{t). Let 
G denote the event corresponding to a good move. When the state of the system 
is {No,Mc', Mjj), the probability of a good move is 

P(G) > + (1 + e)P Mc + Mu 
~ n m 

and the number of attempts between successive good moves is geometric with mean 
at most , .r ""N.f • 

{No+p)Mu 

Let Kq denote the number of attempts before a good move occurs from the state 
{No,Mc, Mu)- The expected number of attempts before a good move reduces No 
satisfies 

^ - (iVo + (l + e)p)(Mc + Mc/) 
Let denote the number of attempts to achieve e-balance. In the worst case, 
No{t) starts at (to — l)p and ends at 1. We can then bound E[Kf] as follows. 



mn 



((1 + e)p + i)(max{^j^, j^m}) 

(m — l)p 



en 

mn 



E 



E 



((1 + e)p + i) (i^m) ((1 + e)p + i) ( (Tq^) 

1 + e ^ 1 ^"^'^1 1 



^((l + e)p + z) ,J^+^i (l + e)p + z) 

+ e , /(l + e)p + en\ , / (to - l)p (l + e)p + en 

< n log ■ ^ + rrm ios ' 



e \ (1 + e)p / \ en (1 + e)p + {m - l)p 

n /I 1+e e 

< — iog(l + em) + mn 



e \ m em m 

n 

< — lOg(TO). 

e 

Since each client is sampling at unit rate, the total sampling rate is n and the 
average time to reach e-balance, E[tj] is E[i4r(:]/n. Thus ¥,[t^] = 0{{\ogm)/e). □ 

4. Open systems 

In open systems, we are interested in quantifying classical queueing performance 
metrics, such as the stability region and the mean client sojourn time. We first 
investigate the stability region achieved under RLO and RLS algorithms. Both 
algorithms are shown to stabilize the system whenever this is at all possible, which 
for load-oblivious RLO algorithm may be surprising. Then, we try to obtain more 
detailed estimates of the system performance. As it turns out, the system equilib- 
rium distribution is difficult, if not impossible, to derive, and we rely on large-system 
asymptotics to provide insights into the way the system behaves. 
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4.1. Stability. In the following, we denote A = (Ai, . . . , Am) and /i = (^i, . . . , /im) 
the vectors representing the arrival and departure rates at the various servers. ||-|| 
denotes the ii-norm on M™. We first provide an upper bound on the maximum 
stability region defined as the set of A such that there may exist a resampling and 
migration strategy stabilizing the system. This set is obtained by assuming that 
all servers' resources are pooled. 

Proposition 4.1. Assume that A is such that '^^Xi > '^ifJ-i- Then there is no 
resampling and migration strategy stabilizing the system. 

Proof. The proof is straightforward. Remark that for any resampling and migration 
strategy, the total service rate is less than fj,i. Then lij^i > Si Mi' ^^c average 
number of clients in the system grows at a rate greater than ^ ■ Ai — ^ ■ fii > 0. 
The system is then unstable. □ 
The two following theorems state that both RLO and RLS strategies achieve 
maximum stability. 

Theorem 4.1. Assume that J2i <^ Si Mi- Then the system is stable under RLO 
algorithm. 

Theorem 4.2. Assume that J2i < Si A**- Then the system is stable under RLS 
algorithm. 

A result somehow similar to that of Theorem 14.11 was first stated in [J using 
heuristic fiuid limits arguments. Fluid limits are powerful techniques to study 
ergodicity of Markov processes [S] . They comprise the study of the system behavior 
in the following limiting regime: the initial condition is scaled up by a multiplicative 
factor k, time is accelerated by the same factor, and k tends to cxd. Often the 
system becomes tractable in this regime and even deterministic. If the system in 
the fiuid regime reaches in a finite time, then the process is ergodic. In the 
fiuid regime, clients stay for very long periods of time in our system, and since, 
under RLO algorithm, the client random walks are ergodic, the probability that a 
given client is associated to server i should be proportional to tt,; (the equilibrium 
distribution of the random walk). In such case, when the client population is 
large (as in the fiuid regime), all servers should be occupied and active, ensuring 
that the system empties in finite time. This is the argument used in [7], but not 
justified. The problem arises because the client migration process actually interacts 
with arrivals and departures. Handling this interaction turns out to be extremely 
difficult. Recently however, in [21], the authors were able to formally derive the 
system fiuid limits, and analyze its stability under very specific assumptions on the 
client random walk (its transition matrix Q has to be diagonalizable) . Their proof 
is quite intricate. In the following, we prove Theorem 14.11 without the use of fluid 
limits, and for any random walk. Our proof is much more direct than that in |21) . 
and hence is amenable to deal with more general cases and possible extensions. For 
the proof of Theorem 14. 2i we use a rather classic method, i.e., we exhibit a simple 
Lyapunov function. 

4.1.1. Proof of Theorem \4.1\ Recall that by definition, under RLO strategy, the 
process {N{t),t > 0) is the Markov process with the following non-zero transition 
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rates for 1 < z 7^ j < m: 

{Q{n,n + ei) = A^, 
n{n,n - ei + ej) = n.^qij, 
n{n,n- Ci) = ^il{„.>o}, 

where n = (ni, . . . , rim) G N™ and is the m-th dhiiensional vector with every 
coordinate equal to 0, except for the zth one equal to 1. The matrix Q = (qij) 
describes the migration of clients, and it is only assumed to possess a unique sta- 
tionary distribution tt = (tt^) such that tt^ > for each i = 1, . . . ,m. The aim of 
the analysis is to use the following result, known as Foster's criterion [20] . 
(Foster's criterion) // there exist K and t > such that 

(9) sup E„(||7V(t)||-||n||)<0, 

neN'":||n||>A' 

where £„(•) = E(-|7V(0) n), then {N{t),t > 0) is ergodic. 

Kolmogorov's equation is the first step that leads to ([9]): for any t > 0, the drift 
E„(|lAr(t)|| - ||n||) is given by 

E„(||7V(<)|| - ||n||) = ||A|It-^ E„ |^^Ma{w.(„)>o}j du. 
This gives the following inequality, which is the basis of our drift analysis: 

(10) E„(||7V(t)||-||7i||)< ||A||t-|l/i|[ [\r.{N{u)>Q)du, 

Jo 

where P,i[-] = P[-|iV(0) = n], and for x G N", a; > is to be understood coordinate- 
wise, i.e., Xi > for each i = 1, . . . , m. 

The idea of the proof of ([9|) is that when the system starts with many clients, 
then the number of arrivals and departures is negligible on the time interval [0,t] 
and the system behaves like the closed one. For a closed system, it is not difficult to 
show, using the fact that Q has an invariant measure, that P{N{u) > 0) for w > 
is arbitrarily close to 1 as the number of clients in the system increases. In view 
of (fTOj) this gives a negative drift when Aj < /z^. 

The following coupling initially proposed and formally justified in [21] is key to 
relate the open and closed systems. For n G and £,p G M™, denote by TV"^ the 
process under RLO strategy starting in the initial state n, with arrival rate £i at 
server i with capacity pi. Then {Nfpit)) is the Markov process with Nfp{Q) ~ n, 
and with non-zero transition rates given by ([5]) with £i instead of A^ and pi instead 
of Pi. Then the processes NJ^q and Np Q can be coupled in such a way that for some 
process Z{t) > 0, 

Nlpit) - Nl^W - N%(t) + Z{t), t > 0. 

Moreover, the processes ||A^°oll ^"0 independent, and ||-/V°o|| is a Poisson 
process with parameter \\p\\. Essentially, this coupling realizes the process N^^ with 
arrivals and departures as the difference between two processes without departures. 
This coupling can be constructed as follows: consider a particle system with three 
kinds of particles, colored blue, red and green. All the particles in the system are 
performing independent continuous-time random walks, going from i to j at rate 
Qij, and the system starts with only blue particles. 



LOAD BALANCING VIA RANDOM LOCAL SEARCH IN CLOSED AND OPEN SYSTEMS 11 



Consider two independent Poisson processes and Ap with respective parame- 
ters pll and llplj: at times of TV^, add a new blue particle at server i with probability 
At times of TVp, consider server i with probability if there is a blue 

particle, choose one at random and turn it into a red one. If there is no blue 
particle, add a green particle. 

If Bi{t), Ri{t) and Gi{t) are respectively the number of blue, red and green 
particles at server i at time t, then it is easy to see that: 

• B is distributed like N^p, 

• B + Ris distributed like N^^^, 

• i? + G is distributed like iV°p and \\B + G\\ = Mp is independent oi B + R. 

This proves the coupling with Z{t) = G{t). The process Nf^ can be seen as the 
superposition of the initial particles with the particles arriving at rate hence 
the additional coupling A^"g = N^^^ + iV°g holds, and finally, ^ can be written 

Nlp{t) = KAt) + Nl,{t) - N%{t) + Z{t), t > 0, 

with A^^o ^'iid ||-/V°q|| independent, and Z{t) > 0. Starting from (fTO| . wc now turn 
our attention to proving the existence of constants K and t which satisfy ([9|) . We 
have, using the coupling's notation, P„(A^(u) > 0) = F{Nl^ ^{u) > 0) and hence, 
for any <u <t and n G N™, 

P„(iVH > 0) = nN^A^) + N%{u) + Z{u) > N%iu)) 

>V{N-,{u)>\\N%iu)\\). 

Since the process Nq q is independent of the random variable IjA'^^^oC*)!!' '^^ can work 
conditionally on the value of ||iV^_o(*)ll ^^d study the quantity ¥{Nq^q{u) > M). 
Thus we only need to consider the closed process Nqq henceforth, and so we simplify 
the notation and note iV^Q ~ N"". Markov's inequality gives 

P(3i £ {1, . . . , m} : Nl\u) < M) = 1 - P(iV"(w) > M) 

rn m 

< ^P(Ar;'(u) < A/) < e^^^E (e-^-"(")) . 

i=l i=l 

For any i £ {1, . . . , m}, 

m 

E (e-^" («))=[] [E,(e-i(«<")->)]"' 

where ^ under ¥j is a continuous-time Markov chain with transition rates Q = {qij ) , 
and which starts at ^(0) = j. If p{j,i,u) = Fj{£_{u) = i), one gets for u > to > 
and n e N'" with ||n|| > K 

E |^e"^>"(")^ ^ e^T=i "j- iog(i-(i-i/e)pb-,8,«)) 

< g-|l'i|l(l-l/e)p(to) < ^-K(l-l/e)p{to) 

with p(to) = infM>to inini<ij<,„ i, it). Note that since, for any I < i, j < m, 
p{j, i,u) > for any u > and p{j, i,u) iTi > as u ^ +oo, one has that 
p{to) > 0. Therefore, for u > Iq and n with ||ri|| < A', integrating on the law of 
\\N%m\ gives 

P(7V"(w)> lKoWII)>l-£(i,^,io) 
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with e{t,K,to) = me\\t'\\^(''-^)-^(^-'^/'')P<^'t«) . In particular, for t > to, 

sup E„(||iV(t)|| - ||n||) < \\X\\t-\\fi\\{t~to){l~e{t,K,to)). 

nGN":||n||>A' 

Since by assumption ||A|| < it is not difficult to choose constants t,to and 

K such that the right hand side is strictly negative (for instance, t ~ 1, small 
enough and K large enough), which gives the result. 

4.1.2. Proof of Theorem \4-S\ Intuitively, it is clear that the load-dependent RLS 
strategy performs better than the load-oblivious RLO policy, since it seems harder 
under RLS to see an empty server. This simple observation shows that the number 
of empty servers should be part of a Lyapunov function, and indeed this leads us 
to define the function / : N'" E+ by: 

m 

Vn, /(n) = y^max(e,ni) = ||n|| + eko{n) 

i=l 

with ko{n) = l{„i=o} + • • • + l{n„=o} the number of empty servers in state n. In 
order for / to be a Lyapunov function, the constant < e < 1 has to satisfy 

e X E < ^iP-^ - Aj) - 7 

i i 

for some 7 > 0. 

Let A'o(n) (resp. Ki{n)) be the set of servers that are empty (resp. have a single 
client). Denote by fco(n) and ki(n) the respective cardinalities of these sets. Let us 
compute the average drift Af{n) of the Markov process N{t) under RLS strategy. 
We have: 

A/(n) = EA.- E ("^ + 4 E E 

I itKo(n) ^ieKi(n) i£Ko{n) ' 

where Y(ri) is the rate in state n at which empty servers are fed by migrating 
clients. 

• If k^iji) = 0, there is no empty servers in state n and in particular Y(n) = 0. 
We have: 

^fin)='Y{Xi-fM) + e E M» < -7, 

because of our choice of e. 

• If kQ^n) > 0, there is at least one empty server in state n. Define p{n) = 
maxi rii. Considering migrations of the p(n) clients from (one of) the 
server(s) with maximum size to one of the empty servers, we obtain: Y{n) > 
^^^^"•^ , which ensures that Af{n) < —7 when p{n) is large enough, say 
greater than K. 

We conclude the proof by considering the drift outside the set F = {n : f{n) < 
m{K + e)}. First remark that F is finite. Then, when n ^ F, p{n) > K . We deduce 
that for all n ^ F: Af{n) < —7. The positive recurrence follows. 
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4.2. Approximate performance estimates. The system behavior m stationary 
regime under RLO and RLS strategies is extremely difficult to analyze. For exam- 
ple, {N{t),t > 0) is unfortunately not reversible under these strategies. To obtain 
estimates of the steady state distribution and client sojourn times, we use large- 
system asymptotics, i.e., we let m grow large. Recently, large-system asymptotics 
have been successfully applied in many context in communication systems. They 
have been used for example to understand load balancing issues such as those aris- 
ing in the supermarket model [T31IIZ]. In the rest of the section, we denote by 
N^"^\t) the vector representing the numbers of clients at time t at each server in a 
system with m servers under either RLO or RLS algorithm. 

In what follows, we consider homogeneous systems where A; = A and yu^ = 1 for 
all i. This restriction simplifies the notation and results, but is not essential. We 
discuss at the end of this subsection how to deal with heterogenous systems. We 
also assume that the number of clients associated to a given server is bounded by 
a (possibly very large) constant B. Again this assumption is not crucial, and can 
be relaxed at the expense of a more involved analysis. 



4.2.1. RLO algorithm. We first consider RLO algorithms. Wc assume that a client 
jumps from one server to another at the instants of a Poisson process of intensity 
/?, and that the next server is chosen uniformly at random. The analysis can be 
extended to any random walk (see §4.2.3p . We represent the system state at time 
t by X^\t) the proportion of servers with exactly k clients at time t. We also 

define 5i'")(i)=E.>.^^^W. 

Let us compute the average change in the system state in a small interval of 

time of duration dt, and more specifically the change in xj^\ Arrivals occur at 
rate Am: An arrival increases xj^"^ if it occurs at servers with k — 1 clients, and 
decreases xj^'' if it occurs at servers with k clients. Hence the change in A^™^ due 
to exogenous arrivals is dtX{xj^\ — A^™-*). Departures can be analyzed similarly. 
Let us now compute the change due to client migrations. Clients migrating to 
server with k — 1 (resp. k) clients increase (resp. decrease) xj^\ In addition, 
clients migrating from servers with k (resp. A: -I- 1) clients decrease (resp. increase) 
A^'"^ The average change in A^™-* due to client migrations is thus: c?t/?((A^,"'j — 

A^™^) J2j j^j"^^ ~ f^^k"^^ + if^+^)^k+l)- summary, the average change in A^™^ 
during dt is: 



dt X 



(m) 



A 



(m) 



(m) 



kX 



(m) 



(fc- 



There is no explicit dependence in m, and hence we expect the dynamics of xj^^ (t) 
to be close to those of a deterministic solution Xk of the following sets of differential 
equations: for all /c S {0, . . . , B}, 



(11) Xk = X{xk-i ~Xk)- {xk - Xk+i) + 13 [{xk~i -Xk)^ jxj - kxk + {k + l)xk+i] , 

i 



14 A. GANESH, S. LILIENTHAL, D. MANJUNATH, A. PROUTIERE, AND F. SIMATOS 



with the convention that x-i = = xb+i- We may write similar differential 
equations for the evolution of s'j^K We obtain: for all A: = 0, . . . , i?, 

(12) ife = (A + /3 ^ s,){sk-i - Sk) - (1 + misk - Sk+i), 

i>i 

with the convention that s_i = = Sb+i- Next we formally justify the above 
analysis and show that (|lip gives an estimate of system behavior that becomes 
exact when m ^ oo. 

Transient regime. The next theorem states that the approximation is exact over 
finite time-horizons, and is a direct application of Kurtz's theorem, see Chapter 11 
in [9]. 

Theorem 4.3. Assume that limm^-cxj ^'■™H0) = x{0) almost surely. Fix t > 0. 
We have: almost surely, 

(13) lim sup||X(™)(u) - x{u)\\ ^ 0, 

where x{-) is the unique solution of ill]) with initial condition x(0). 

Proof. First, one can easily represent the family of processes {X^'^^t),t > 0) as 
a family of density dependent population processes as for example defined in [9]. 
Then, define F : R^+^ by: for ah x £ R^+^, F^i{x) = = Fg+iix) and, 

for ah fc = 0, . . . , B, 

Fk{x) = Xk-i{X + l3^jxj) - XkiX + I3k + 1) + Xk+i- 

3 

Now (Hll) writes x = F{x). F is Lipschitz on T = {x G R^"*"^ : a-_i = = 

xb+1, '^k=o -^fc ~ ^ consequence, the conditions of the theorem stated in [9] 

p 456 are met, and we deduce the expected result. □ 

Stationary regime. The above theorem holds for finite time-horizons only. It 
does not say anything about the long-term behavior of the system and in particular 
for example about the average stationary client sojourn time. To circumvent this 
difficulty we may use the advanced framework formalized by Sznitman [22j and 
further developed in [13], and more recently in [6]. Due to space limitations, we skip 
all details. We invite the reader either to verify that results in [6] apply here or to 
follow step by step the arguments in [T3| to prove the convergence of the steady-state 
behavior of finite systems towards the equilibrium point of dynamical system pip 
when m — > oo. More precisely, denote by Xg™' the stationary empirical distribution 
of the system with m servers (such distribution exists because {N^"^''{t),t > 0) is a 
irreducible finite-state Markov process, and thus positive recurrent). 

Theorem 4.4. Assume that from any initial condition in T, the solution of ill]) 
converges to a unique equilibrium point ^. Then xi™-* converges to ^ when 771 — > oo. 

From the previous theorem, we know that in a system of m servers, the pro- 
portion of servers handling k clients in the stationary regime gets close to as m 
grows large. We may also approximate the average number of clients in the system 
by X]fe>i and deduce an estimate of the average sojourn time using Little's for- 
mula. It remains to show that the system of differential equations (fTTj) converges 
to a unique equilibrium point ^, and to characterize ^. 



LOAD BALANCING VIA RANDOM LOCAL SEARCH IN CLOSED AND OPEN SYSTEMS 15 



Let ^ be a fixed point of ([TT|) . tlien we easily see that: for all i = 1, . . . , S, 

4i - ?o X — — 



nu(i+/3jr 



where y = '^jjS.j- Co is obtained so that ^ is a probability measure. Finally, y 
must solve: 



(14) y X 



sn'.,(i+A)) 



B 



(A + ^yY 



S n}=i(l + /3j) 



One can check that if A < 1, (ITil) indeed has a unique positive solution y: if 
z = A + /3y, z must solve g{z) = with: 

giz) = iz~X)[l + J2 '\ - E 



ttn:=i(i+/3jr n;=i(i+/3j) 

The result follows from g(A) < and g'{z) > for all z > 0. In summary the 
unique equilibrium point of (jlip is C. 

Theorem 4.5. From any initial condition x{0) G T, if X < ^i, the system of 
differential equations ill]) converges to the unique equilibrium point 

Proof. The system enjoys the following important monotonicity property. Consider 
two initial conditions x{0) and x'(0) such thaiQ a:(0) <st x'{0), then if x and x' 
arc the solutions of (|lip with respective initial conditions x{0) and a:;'(0), we have 
at any time t > 0, x(t) <st x'(t). The proof of this property is based on a prob- 
abilistic interpretation of the dynamical system (jlip as the Kolmogorov equations 
of a collection of birth-death processes of birth rate X + (3 J^j j^j ^-nd death rates 
(1 + /3k) in state k. The idea is that for any s > 0, x{s) <st x'{s) implies that 
'^jjxj{s) < jx'j{s), so the birth rate at time s for x is smaller than that for 
x' , and by a standard coupling argument, we deduce that just after time s, we still 
have x{s+) <st x'{s+). Wc may further deduce that this ordering remains valid 
over time. 

Denote by x^ (resp. x^) the solution of ([TT|) when the system is initially empty 
a;^(0) = (1, 0, . . . , 0) (resp. full x^{0) = (0, . . . , 0, 1)). A direct consequence of the 
above monotonicity property is that x^{t) (resp. x^ {t)) is stochastically increasing 
(resp. decreasing) over time. For example, for all > 0, x^{t -f h) >st x^{t). 
This implies that both x^{t) and x^ {t) converge to ^ when < — > oo (since the 
equilibrium point is unique). We deduce that such convergence also holds start- 
ing from any initial condition x(Q), since again due to the monotonicity property 
x^{t) <st x{t) <st x^{t) for all t. □ 

4.2.2. RLS algorithm. The large-system approximation method developed above 
applies to RLS algorithms. We can similarly derive a deterministic approximation 
for the evolution of the system empirical measure When m — !■ cxo, this 



<st denotes the usual strong stochastic order, i.e., if x,y are probabiUty measures on 
{0, . . . , m), X <st y iff for all j, X)i=o ^ X)i=o Vi- 
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evolution is characterized by: for all /c = 0, . . . , i? 



(15) Xk = X{xk-i - Xk) - {xk - Xk+i) + 



13 Xk-i ^ jxj ~Xk j^J 



kxk ^ Xj + {k + l)xk+i ^ 



X 



j>k+l j>k+2 




with by convention x-2 = X-i = Xb+i = xb+2 = 0. Analyzing the dynamical 
system (|15p is not straightforward and deserves a full study, which we skip here 
due to space limitations. In all numerical experiments presented below, we verified 
the convergence of (|15p to a unique equilibrium point. 

4.2.3. Extension to heterogenous systems and arbitrary random walks (for RLO 
algorithm). The above asymptotic analysis has been simplified by considering ho- 
mogenous systems and uniform random walks (for RLO) only. However, in the case 
of RLS algorithm, it can be easily extended to the case of heterogenous systems, 
where the arrival rates and server speeds are not identical. To do so, we may classify 
server according to their arrival rate and speed - servers of the same class have same 
arrival rate and speed. Then, we can derive a set of differential equations, similar 
to or (jisp. approximating the evolution of the proportion of servers of a given 
class and handling a given number of clients. We obtain a dynamical system whose 
variables Xy^k represent the proportion of servers of class v having k clients. In 
the case of RLO algorithm, the analysis may also be extended to arbitrary random 
walks; it suffices to include into the server class the rates at which clients jump 
towards other servers. For example, servers of class v have the same arrival rate 
and speed, and the rate at which a client at one of class-u servers jumps to a server 
of class v' depends on v and v' only. In [6], the authors present such multi-class 
asymptotic analysis in details. 

4.3. Numerical experiments. We now illustrate the results derived in this sec- 
tion via simple numerical experiments. To evaluate the relative performance of 
RLO and RLS algorithms, we consider first an homogenous system (for all i, Xi ~ A, 
Hi = 1), and then an extreme heterogenous system where all clients arrive at the 
same server (Ai = mA, and for all i > 2, A, = 0). The system performance is 
expressed in terms of the average client throughput, defined as the inverse of the 
average sojourn time. 

Figure [1] gives the average client throughput as a function of A in homogenous 
systems. We compare the results obtained through the large-system asymptotics 
m — oo and those obtained for m = 10 servers. Note that the asymptotics results 
are pretty accurate even for small systems. Actually at a load of 0.8, the relative 
error made in our approximations of the average throughput under RLO and RLS 
algorithms is less than 4% when to = 5, and becomes less than 0.5% for to = 20. 
Note that RLO and RLS are both stable if and only if A < 1. Surprisingly the 
performance improvement achieved by the load-dependent RLS algorithm over that 
obtained under the load-oblivious RLO algorithm is not that significant, typically 
less than 20%. 

Figure [2] provides the performance in heterogenous systems with to = 5 and 
TO = 10. We provide simulation results only, although, as explained above, we 
could have obtained analytic asymptotic results. Again as expected, even if all 
clients arrive at the same server, RLS and RLO stabilize the system whenever 
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Figure 1 . Mean throughput under RLS and RLO in homogenous 
systems as a function of the load A. /3 = 0.5. 



possible (when A < 1). The difference between the throughput achieved by RLS 
and RLO is quite small irrespective of the number of servers considered. Hence 
it seems that implementing a load-dependent resampling and migration algorithm 
may not significantly improve the performance. 



5. Related work 

There have been many studies on distributed, selfish load balancing algorithms 
and routing games in closed systems, see e.g. [16] and references therein. Refer 
to [H] for a quite exhaustive survey. Much of the work in this area has concentrated 
on finding the fastest sequence of moves that would balance the system, also called 
Nashification [11]. One class of algorithms is the elementary step system, first 
described in [19] in which a sequence of best response moves are performed by the 
clients. Of course this requires that the clients know the status of all the other 
servers. In [4j[5] the authors study closed systems with limited information about 
the servers' status. They consider a synchronous system where at each step, each 
server samples a new server randomly and if the load of the sampled server is 
smaller, then a client moves with probability {N^ — Nn)/Nc, where Nc is the load 
on the current server and 7V„ is the load of the sampled server. It is shown that 
the expected time to balance the system is O (log log m + n^). A modification of 
this load balancing algorithm is studied in [5], and it is shown that the expected 
time to balance the system is 0(logm + nlogn). In [T^], the author considers 
clients dynamics identical to those considered in this paper and uses the potential 
function introduced in [10] to quantify the time to achieve system balance. It is 
shown that the expected time to reach a balance scales at most as 0{m?). We 
provide significant improvements on this bound. 
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Figure 2. Mean throughput under RLS and RLO in heterogenous 
systems as a function of the load X. /3 = 0.5. 



In open systems, the chent moves interact in a comphcated manner with the 
chent arrival and departure processes. There is very little work trying to under- 
stand this interaction. None of the existing work deals with a system similar to 
that studied here. For instance, |2] analyzes the interaction in a game-theoretical 
framework, where arrivals are adverserial, and where a central controller moves 
clients with the aim of stabilizing the system. The performance of the classical 
work stealing load-balancing scheme has also been studied, see e.g. [3] and refer- 
ences therein. Of course there is an abundant literature on the performance of 
classical load-balancing schemes in open systems where clients are assigned to a 
given server for the entire duration of their service, see e.g. the analysis of the 
supermarket model in [13l[T7]. To our knowledge, the present paper provides the 
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first analysis of natural distributed resampling and migration strategies in open 
systems. 

6. Conclusion 

In this paper, we have analyzed the performance of distributed load balanc- 
ing schemes where clients independently decide to resample and change server to 
improve their service rate. We considered two natural random resampling and mi- 
gration strategies: A load oblivious strategy RLO where clients randomly move 
from one server to another without accounting for the actual server loads, and a 
load-dcpcndcnt selfish strategy RLS where clients randomly resample servers and 
migrate only if their rate is improved. 

In closed systems where the population of clients is fixed, we have provided a 
new tight bound on the time to balance server loads under RLS strategy. This time 
can be interpreted as the time to reach a Nash Equilibrium in this selfish routing 
game. Our bound considerably improves the bounds available in the literature. 
But it holds only in the case of homogenous systems where servers have identical 
service rates. It seems challenging and interesting to figure out how to apply our 
methodology to obtain bounds on the time to balance the system in the case of 
heterogenous systems. It might also be interesting to investigate the time it takes 
to balance the system in scenarios where client migrations are limited, in the sense 
that from a given server, clients can migrate to a restricted subset of servers (as for 
example specified via a graph). 

In open systems where clients arrive at the various servers at different rates, we 
provided a first analysis of the system dynamics. These dynamics are complicated 
as the client arrival and departure processes interact with the client migration 
processes. We have shown that both RLO and RLS load balancing strategies are 
able to stabilize the system whenever this is at all possible. It may appear somehow 
surprising that a completely distributed and load-oblivious algorithm such as RLO 
can achieve maximum stability. Using large-system asymptotics, we also provided 
approximate estimates of the mean client sojourn time. The results show that again, 
surprisingly, the load-oblivious RLO strategy does not yield significant performance 
losses compared to the load-dependent RLS strategy. These findings arc valid for 
exponential service requirements, and it would be interesting to know whether they 
remain valid for other service requirement statistics. 

An interesting extension of the present work (especially relevant when consider- 
ing spectrum sharing issues) is to analyze the case where clients may use resources 
from several servers simultaneously. There are some preliminary results in this di- 
rection in |15| , but neither the time to reach equilibrium or the population dynamics 
are studied. 
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